DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers

نویسندگان

  • Wei Hu
  • Yanhuang Jiang
  • Guangming Liu
  • Wenrui Dong
  • Guilin Cai
چکیده

Reliability has become an issue to the Tianhe supercomputer series with the scaling of the system. Proactive fault-tolerance based on failure prediction turns into an effective way to improve the system’s fault tolerance ability. Data collection is the basis of the failure prediction which has a great impact on the prediction accuracy, while current data collection methods for failure prediction only got limited data with large overhead. This paper presents DDC data collection framework for failure prediction in Tianhe supercomputers. DDC adopts a distributed data collection architecture which can fully collect the data related to the compute nodes’ health with high efficiency. Through the testing for DDC which ran on TH-1A, the results indicated that DDC had the advantage of low cost and good scalability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Solving global shallow water equations on heterogeneous supercomputers

The scientific demand for more accurate modeling of the climate system calls for more computing power to support higher resolutions, inclusion of more component models, more complicated physics schemes, and larger ensembles. As the recent improvements in computing power mostly come from the increasing number of nodes in a system and the integration of heterogeneous accelerators, how to scale th...

متن کامل

Scaling up Hartree-Fock calculations on Tianhe-2

This paper presents a new optimized and scalable code for Hartree–Fock self-consistent field iterations. Goals of the code design include scalability to large numbers of nodes, and the capability to simultaneously use CPUs and Intel Xeon Phi coprocessors. Issues we encountered as we optimized and scaled up the code on Tianhe-2 are described and addressed. A major issue is load balance, which is...

متن کامل

High performance computational biology and drug design on TianHe Supercomputers

Extremely powerful computers are needed to help scientists to handle high performance computational biology and drug design problems. The world’s largest genomics institute BGI currently generates 6 TB data each day. The European Bioinformatics Institute (EBI) in Hinxton currently stores 20 petabytes (1 petabyte is 1015 bytes) of data and back-ups about genes, proteins and small molecules. Tian...

متن کامل

Distributed Sensor Network for meteorological observations and numerical weather Prediction Calculations

The prediction of weather generally means the solution of differential equations on the base of the measured initial conditions where the data of close and distant neighboring points are used for the calculations. It requires the maintenance of expensive weather stations and supercomputers. However, if weather stations are not only capable of measuring but can also communicate with each other, ...

متن کامل

Resilience-Based Framework for Distributed Generation Planning in Distribution Networks

Events with low probability and high impact, which annually cause high damages, seriously threaten the health of the distribution networks. Hence, more attention to the issue of enhancing network resilience and continuity of power supply, feels more than ever, all over the world. In modern distribution networks, because of the increasing presence of distributed generation resources, an alternat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015